OcrV1, Main, Exploration, bibRecord, 001168

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Identifieur interne : 001168 ( Main/Exploration ); précédent : 001167; suivant : 001169

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Auteurs : Adenike M. Lam-Adesina [Irlande (pays)] ; Gareth J. F. Jones [Irlande (pays)]

Source :

Information processing & management [ 0306-4573 ] ; 2006.

RBID : Pascal:06-0291073

Descripteurs français

Pascal (Inist)
- Document image, Pertinence, Rétroaction, Elargissement requête, Reconnaissance optique caractère, Erreur.

English descriptors

KwdEn :
- Error, Feedback regulation, Image document, Optical character recognition, Query expansion, Relevance.

Abstract

Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR (Optical Character Recognition) systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data.

Affiliations:

Irlande (pays)

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000386
to stream PascalFrancis, to step Curation: 000400
to stream PascalFrancis, to step Checkpoint: 000336
to stream Main, to step Merge: 001197
to stream Main, to step Curation: 001168

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents</title>
<author><name sortKey="Lam Adesina, Adenike M" sort="Lam Adesina, Adenike M" uniqKey="Lam Adesina A" first="Adenike M." last="Lam-Adesina">Adenike M. Lam-Adesina</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>School of Computing and Centre for Digital Video Processing, Dublin City University</s1>
<s2>Glasnevin, Dublin</s2>
<s3>IRL</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Irlande (pays)</country>
<wicri:noRegion>Glasnevin, Dublin</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Jones, Gareth J F" sort="Jones, Gareth J F" uniqKey="Jones G" first="Gareth J. F." last="Jones">Gareth J. F. Jones</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>School of Computing and Centre for Digital Video Processing, Dublin City University</s1>
<s2>Glasnevin, Dublin</s2>
<s3>IRL</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Irlande (pays)</country>
<wicri:noRegion>Glasnevin, Dublin</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">06-0291073</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 06-0291073 INIST</idno>
<idno type="RBID">Pascal:06-0291073</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000386</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000400</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000336</idno>
<idno type="wicri:doubleKey">0306-4573:2006:Lam Adesina A:examining:and:improving</idno>
<idno type="wicri:Area/Main/Merge">001197</idno>
<idno type="wicri:Area/Main/Curation">001168</idno>
<idno type="wicri:Area/Main/Exploration">001168</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents</title>
<author><name sortKey="Lam Adesina, Adenike M" sort="Lam Adesina, Adenike M" uniqKey="Lam Adesina A" first="Adenike M." last="Lam-Adesina">Adenike M. Lam-Adesina</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>School of Computing and Centre for Digital Video Processing, Dublin City University</s1>
<s2>Glasnevin, Dublin</s2>
<s3>IRL</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Irlande (pays)</country>
<wicri:noRegion>Glasnevin, Dublin</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Jones, Gareth J F" sort="Jones, Gareth J F" uniqKey="Jones G" first="Gareth J. F." last="Jones">Gareth J. F. Jones</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>School of Computing and Centre for Digital Video Processing, Dublin City University</s1>
<s2>Glasnevin, Dublin</s2>
<s3>IRL</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Irlande (pays)</country>
<wicri:noRegion>Glasnevin, Dublin</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Information processing & management</title>
<title level="j" type="abbreviated">Inf. process. manag.</title>
<idno type="ISSN">0306-4573</idno>
<imprint><date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Information processing & management</title>
<title level="j" type="abbreviated">Inf. process. manag.</title>
<idno type="ISSN">0306-4573</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Error</term>
<term>Feedback regulation</term>
<term>Image document</term>
<term>Optical character recognition</term>
<term>Query expansion</term>
<term>Relevance</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Document image</term>
<term>Pertinence</term>
<term>Rétroaction</term>
<term>Elargissement requête</term>
<term>Reconnaissance optique caractère</term>
<term>Erreur</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR (Optical Character Recognition) systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data.</div>
</front>
</TEI>
<affiliations><list><country><li>Irlande (pays)</li>
</country>
</list>
<tree><country name="Irlande (pays)"><noRegion><name sortKey="Lam Adesina, Adenike M" sort="Lam Adesina, Adenike M" uniqKey="Lam Adesina A" first="Adenike M." last="Lam-Adesina">Adenike M. Lam-Adesina</name>
</noRegion>
<name sortKey="Jones, Gareth J F" sort="Jones, Gareth J F" uniqKey="Jones G" first="Gareth J. F." last="Jones">Gareth J. F. Jones</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001168 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001168 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:06-0291073
   |texte=   Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri